Skip to content

CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813

Open
Jens-G wants to merge 2 commits into
apache:trunkfrom
Jens-G:jens-g/CASSANDRA-21131/trunk
Open

CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813
Jens-G wants to merge 2 commits into
apache:trunkfrom
Jens-G:jens-g/CASSANDRA-21131/trunk

Conversation

@Jens-G

@Jens-G Jens-G commented May 15, 2026

Copy link
Copy Markdown
Member

Summary

COPY TO followed by COPY FROM corrupts text column values that contain backslashes: each round-trip doubles the backslash count. Reported in CASSANDRA-21131.

Before (one round-trip):

  • Stored: V\S → exported CSV: V\\\\S → re-imported: V\\S
  • Stored: \"Marianne"\ → re-imported: \\"Marianne"\\

list<text>, set<text>, map<text,text>, tuples and UDTs with text fields are affected in the same way.

Root Cause

format_value_text in formatting.py doubles backslashes unconditionally:

escapedval = val.replace('\\', '\\\\')

This is intentional for terminal display (SELECT output shows V\\S so the backslash is visible). However, ExportProcess.format_value in copyutil.py calls the same function when writing CSV. The csv.writer (configured with escapechar='\\') then escapes backslashes a second time, quadrupling them in the CSV file. On COPY FROM the csv.reader unescapes once, leaving doubled backslashes in Cassandra.

Fix

Add an escape_backslash parameter (default True, preserving existing terminal display behaviour) to format_value_text, format_simple_collection, and all collection formatters. Pass escape_backslash=False from ExportProcess.format_value so the csv.writer handles all backslash escaping exclusively.

Changed functions:

  • format_value_text — new parameter
  • format_simple_collection — new parameter, propagated to element format_value calls
  • format_value_list, format_value_set, format_value_tuple — new parameter, forwarded to format_simple_collection
  • format_value_map — new parameter, propagated through subformat
  • format_value_utype — new parameter, propagated through format_field_value
  • ExportProcess.format_value in copyutil.py — passes escape_backslash=False

Test Plan

Two standalone Python test scripts (no running Cassandra cluster required) are attached to the JIRA ticket and verify the bug and fix:

  • test_cassandra_21131.py — 10 test cases for plain text columns: 5/10 pass before fix → 10/10 after
  • test_cassandra_21131_collections.py — 12 test cases for list/set/map<text>: 3/12 before → 12/12 after

Integration testing against a live cluster with the exact scenario from the bug report (COPY TOTRUNCATECOPY FROMSELECT) is needed before merge.

Notes

  • A separate but related bug (UNICODE_CONTROLCHARS_RE converting control chars like \n to repr-notation \\n during CSV export) was discovered and will be tracked in a separate ticket.
  • The Generated-by: commit token is included per ASF generative tooling policy. The fix was developed with AI assistance (Claude Sonnet 4.6 / Anthropic) under human review and direction. All code has been verified manually.

🤖 Generated with Claude Code

…ng backslashes

format_value_text in formatting.py doubles backslashes for terminal display
(so SELECT output renders them visibly). When used via ExportProcess.format_value
for COPY TO, this pre-escaping is applied before csv.writer runs its own
backslash escaping (escapechar='\\'), resulting in quadrupled backslashes in the
CSV file. On COPY FROM the csv.reader unescapes once, leaving doubled backslashes
in Cassandra — data corruption that compounds on every round-trip.

The fix adds an escape_backslash parameter (default True, preserving existing
terminal display behaviour) and passes escape_backslash=False from the CSV
export path in ExportProcess.format_value. The parameter is propagated through
format_simple_collection, format_value_list/set/tuple/map, and format_value_utype
so that collection types (list<text>, set<text>, map<text,text>, UDTs) are
covered as well.

Generated-by: Claude Sonnet 4.6 (Anthropic) with human review and direction

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@bschoening

bschoening commented May 18, 2026

Copy link
Copy Markdown
Contributor

@Jens-G It seems the exporter is sending values through the display formatter, which doubles backslashes for human-readable SELECT output, before handing them to the CSV writer. Claude suggests just stop using the display formatter for CSV export of text in copyutil.py. What would you think of that approach?

def format_value(self, val, cqltype):
    if val is None or val == EMPTY:
        return format_value_default(self.nullval, colormap=NO_COLOR_MAP)

    # Text-like values: pass through unmodified. The csv.writer will apply
    # CSV-level escaping using the dialect's escapechar. Running these
    # through the display formatter would double-escape backslashes, see
    # CASSANDRA-21131.
    if cqltype.type_name in ('text', 'varchar', 'ascii'):
        return val
    ...

@Jens-G

Jens-G commented May 18, 2026

Copy link
Copy Markdown
Member Author

TBH if it works I'm fine with either approach 👍 🚀

@bschoening bschoening left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider an alternative approach where formated_value() bypasses formatting for text

if cqltype.type_name in ('text', 'varchar', 'ascii'):
return val

formatted = formatter(val, cqltype=cqltype,
encoding=self.encoding, colormap=NO_COLOR_MAP, date_time_format=self.date_time_format,
float_precision=cqltype.precision, nullval=self.nullval, quote=False,
escape_backslash=False,

@bschoening bschoening May 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider an alternative approach where formatted_value() bypasses display formatting for text.

format_value():
...
if cqltype.type_name in ('text', 'varchar', 'ascii'):
    return val if val.isprintable() else None

escape_backslash=False,
decimal_sep=self.decimal_sep, thousands_sep=self.thousands_sep,
boolean_styles=self.boolean_styles)
return formatted

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit tests

@arvindKandpal-ksolves

arvindKandpal-ksolves commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Hi @Jens-G , if you are busy I can add the unit tests for this and you can review that — just let me know if that works. ?

@bschoening

bschoening commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@Jens-G the PR needs additional work. Based on our discuss of RFC 4180, proper text values don't require formatting. In format_value():

if cqltype.type_name in ('text', 'varchar', 'ascii'):
    return val if val.isprintable() else None

that may eliminate the need to have a new argument for escape_backslash

… handling

Add TestExportFormatValue to test_copyutil.py, covering ExportProcess.format_value:
- scalar text/varchar/ascii values keep single backslashes on export
- text inside list/set/map/tuple keeps single backslashes (the collection
  formatters propagate escape_backslash, which a scalar-only type check misses)
- backslashes survive a csv.writer -> csv.reader round-trip with the COPY dialect
- the terminal-display path still doubles backslashes (escape_backslash default),
  preserving SELECT rendering

The tests were confirmed to fail when the escape_backslash=False export change is
reverted, so they guard against regressing the fix.

Generated-by: Claude Opus 4.8 (Anthropic) with human review and direction
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Jens-G

Jens-G commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

Thanks @bschoening! I worked through the scalar fast-path. It's a clean simplification for text/varchar/ascii, but I don't think it lets us drop the escape_backslash argument — here's the case it misses.

The ticket also covers list<text>, set<text>, map<text,text>, tuples and UDTs. Those aren't serialized as scalars: format_value_list/_set/_map/_utype format each element by routing it back through format_value_text (with quote=True). A cqltype.type_name in ('text','varchar','ascii') check in format_value() never sees them, so their elements still get their backslashes doubled. Since collection elements always pass through format_value_text, the no-double-escape signal has to live on that function regardless of what we do at the scalar level — so the argument can't be fully removed without dropping collection support.

I've added TestExportFormatValue in this PR to pin this down. The collection cases serialize list/set/map/tuple<text> through the export path and assert single backslashes; they fail (doubled) if the escape_backslash=False change is reverted — which is effectively what a scalar-only check does to collections. (On import there's no compensation either: the elements are CQL string literals, and dequote_value only un-doubles single quotes, not backslashes.)

On return val if val.isprintable() else None: with the parameter approach it's moot, but for the record else None would silently null the whole value for any text containing a newline or tab (it becomes an empty field). I understand the intent of keeping control characters out of the CSV — but that's separate from the backslash doubling, and format_value_text rewrites control chars via UNICODE_CONTROLCHARS_RE either way, so I'd keep it out of this ticket and handle it in the UNICODE_CONTROLCHARS_RE follow-up I mentioned.

Happy to add a scalar fast-path on top of the parameter if you'd like it for readability — as an addition rather than a replacement. WDYT?

@Jens-G

Jens-G commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

@arvindKandpal-ksolves thanks a lot for offering to help! I've gone ahead and added the unit tests in this PR (TestExportFormatValue in test_copyutil.py), so this one is covered. Much appreciated. 🙏

@bschoening

bschoening commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@Jens-G Based on what we learned with CASSANDRA-21381, we should broaden the scope here to handle all TEXTDATA properly.

COPY TO should allow roundtrip all TEXTDATA. Backslash is just another character in the %x2D-7E range - a perfectly ordinary TEXTDATA character with no special status so escape_backslash as an option isn't needed. The only escape production is DQUOTE.

  1. format_value: agreed, let's fast-path it — skip formatting for printable text and sanitize only the non-printable cases..

  2. control characters: agree that else None is too aggressive. Text containing TAB, CR, LF should be enclosed in double quotes (the CSV escaped form), in a sanitize field function. The remaining C0/C1 controls are outside the scope of RFC 4180; rather than support round-tripping them, I'd suggest a one-way, percent-encoded translation (%00, %1B, …) — non-reversible by design, and less easily confused with Python's own backslash escaping.

  3. CSV escaping implies that COPY FROM would revert DQUOTE escapes

  4. collections: agreed -- list, set, map<text,text> need the same sanitization and should be in scope here.

BTW, sineemore/csv-test-data has test fixtures we could cherry-pick for unit tests.

Relevant RFC 4180 grammer:

field = (escaped / non-escaped)

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

non-escaped = *TEXTDATA

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

DQUOTE = %x22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants